Checkpointed Tag Prefetcher

ABSTRACT

A dual-mode prefetch system for implementing checkpoint tag prefetching includes: a data array for storing data fetched from cache memory; a set of cache tags identifying the data stored in the data array; a checkpoint tag array storing data identification information; and a cache controller with prefetch logic.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a division of, and claims priority to,commonly-owned co-pending U.S. patent application Ser. No. 11/736,548,filed on Apr. 17, 2007; which application is incorporated by referenceas if fully set forth herein.

STATEMENT REGARDING FEDERALLY SPONSORED-RESEARCH OR DEVELOPMENT

None.

INCORPORATION BY REFERENCE OF MATERIAL SUBMITTED ON A COMPACT DISC

Not Applicable.

FIELD OF THE INVENTION

The invention disclosed broadly relates to the field of cache memory andmore particularly relates to the field of tag prefetching of cachememory.

BACKGROUND OF THE INVENTION

Main memory access speed has not increased as fast as the clock speedsof modern microprocessors. As a result, the latency of memory access interms of processor clock cycles has increased over time. These longlatencies result in performance degradation when a processor must accessmain memory. Computer architects have mitigated this problem through theintroduction of one or more caches associated with each processor, suchas Level 2 cache, built using SRAM memories whose access latencies aremuch lower than the latencies of main memory. Caches are small images ofa portion of main memory. These caches store the data most recentlytouched by a processor, and when a processor touches(reads/writes/updates) this data again, its performance is improved byavoiding main memory access stalls. Cache access is much faster thanmain memory access, therefore cache access should be maximized. However,cache is more expensive than main memory, consequently it has lesscapacity than main memory. The problem becomes how best to maximizecache utilization while working within a small space.

Referring now in specific detail to the drawings, and particularly FIG.1, there is illustrated a conventional cache implementation 100.Hardware caches, as they are conventionally built by various computermanufacturers, consist of a data array 130 containing the data stored inthe cache, a tag array 120 containing meta-data that identifies the setof data currently residing in the data array, and attributes for eachdata item (e.g. least recently used “LRU” state, coherence permissions),a decoder 140 which is used to map an address to the set of the cachethat may contain that address, and a cache controller 110, which is afinite-state machine that controls the refilling of the cache as aresponse to processor requests that miss in the cache.

Referring to FIG. 2 there is shown a flow chart of the typical flow of asingle request performed by the processor, according to the known art.The process begins at step 210 with the processor initiating a requestby driving a request address in conjunction with a request signal to thecache's decoder 140. In step 220 the cache decoder 140 determines theset number to which this address corresponds, and drives a signal tothat set. In step 230 this signal activates the tag-match logiccorresponding to that particular set, which compares the remaining bitsof the processor's address to the current contents of that set.

In step 240 in the event of a cache miss, the processing proceeds tostep 270 where the cache hit signal is held to zero, and a cache refillis initiated by the cache controller. If, however, there is a cache hit,then in step 250 the cache hit signal is asserted, and a signal is sentto the data array for that cache block. Next, in step 260 the data fromthe data array is read and sent to the requesting processor. In mostcache implementations, the tag metadata contains the upper-order bits ofeach cache block's physical address.

The drawback with these caches is that their capacity is small relativeto the size of main memory; consequently, their contents must becarefully managed to maximize the probability that a processor's futurememory accesses will be available from the cache. Rather than simplyretaining the most recently touched data, many processors implementprefetching mechanisms that predict those memory locations that aprocessor will reference in the future, and preload this data into thecache in preparation for the processor's upcoming demands. Theseprefetching mechanisms can be categorized as software prefetchers orhardware prefetchers.

Software prefetching is supported by the processor through one or morespecial prefetch instructions that are inserted into a program'sinstruction sequence by a programmer, compiler, or run-time system,based on some knowledge of the application's future memory referencepattern. A prefetch instruction causes the processor to preload a memorylocation into the processor's cache, without stalling the processorwhile it is being loaded. Unlike software prefetching, hardwareprefetchers operate independently of any software control. By monitoringeither the pattern of memory accesses being performed by a processor, orthe pattern of cache miss requests from a cache, these prefetchers canpredict a processor's future memory access pattern, and preload thisdata without any support from the programmer, compiler, or run-timesystem.

In many scenarios, the benefits of known hardware and softwareprefetchers are limited, because a program's reference pattern may bedifficult to determine or summarize in a manner amenable to eitherhardware or software. Also, an application's recent memory referencehistory may not be indicative of its future memory reference pattern.For example, many applications exhibit phase behavior, in which theapplication's working set of memory locations and memory access patternsare consistent within one phase of time, however may vary wildly acrossdifferent phases. These phases may also be periodic, following apredictable sequence of phases (e.g. ABCABCABC, where each letterrepresents a distinct phase of execution). When phases change, thehardware prefetcher may not remember the reference pattern that occurredduring a previous execution of the new phase, and must incur a trainingperiod during which its effectiveness is limited.

Therefore, there is a need for a prefetch mechanism to overcome thestated shortcomings of the known art.

SUMMARY OF THE INVENTION

Briefly, according to an embodiment of the invention a dual-modeprefetching method using checkpoint tags, includes the following stepsor acts when in normal mode: fetching a cache block from cache memory;writing data from the cache block into a data array; writing an addressand metadata of the data into an array of cache tags; and writing an atleast one identifier of the data into at least one array of checkpointtags, wherein the at least one array of checkpoint tags is located in aprefetch mechanism.

The method may optionally include steps of receiving a checkpointrequest; and switching from normal mode to checkpoint mode.Additionally, the method may include a step of storing a thread-id of acurrently-running software application into the at least one array ofcheckpoint tags.

According to an embodiment of the present invention, a prefetchingmethod in checkpoint mode includes steps or acts of: receiving a saverequest; fetching at least one cache block from cache memory; writingdata from the at least one cache block into a data array; and writing anaddress and metadata of the data into an array of cache memory tags, thecache memory tags referencing locations of data in the data array. Themethod continues with receiving a restore request; fetching anidentifier for the at least one cache block in at least one checkpointtag array; reloading the cache memory with the at least one cache blockreferenced by its identifier in the at least one checkpoint tag arraythat is not already stored in the cache memory; and switching to normalmode.

A dual-mode prefetch mechanism for implementing checkpoint tagprefetching includes: a data array for storing data fetched from cachememory; a set of cache tags for identifying the data stored in the dataarray; at least one set of checkpoint tags for storing dataidentification; a cache controller including prefetch logic, theprefetch logic including a checkpoint prefetch controller and acheckpoint prefetch operator.

The method can also be implemented as machine executable instructionsexecuted by a programmable information processing system or as hardcoded logic in a specialized computing apparatus such as anapplication-specific integrated circuit (ASIC).

BRIEF DESCRIPTION OF THE DRAWINGS

To describe the foregoing and other exemplary purposes, aspects, andadvantages, we use the following detailed description of an exemplaryembodiment of the invention with reference to the drawings, in which:

FIG. 1 depicts a conventional cache implementation, according to theknown art;

FIG. 2 is a flow chart of the typical flow of a single request in aconventional cache implementation, according to the known art;

FIG. 3 depicts a cache mechanism utilizing a checkpointed tagprefetcher, according to an embodiment of the present invention;

FIG. 4 is a flow chart of the checkpointed tag prefetcher when operatingin normal mode, according to an embodiment of the present invention; and

FIG. 5 is a flow chart of the checkpointed tag prefetcher when operatingin checkpointed mode, according to an embodiment of the presentinvention.

While the invention as claimed can be modified into alternative forms,specific embodiments thereof are shown by way of example in the drawingsand will herein be described in detail. It should be understood,however, that the drawings and detailed description thereto are notintended to limit the invention to the particular form disclosed, but onthe contrary, the intention is to cover all modifications, equivalentsand alternatives falling within the scope of the present invention.

DETAILED DESCRIPTION

We describe a prefetching mechanism based on cache tag checkpointing,useful for reducing cache misses in instruction, data, and unifiedcaches. Cache tags (sometimes also known as “directories”) are a portionof the cache memory that labels the contents of the data stored in thecache, so that the contents of the cache may be easily determined.Although very small relative to the overall cache size, the tags can beused to reconstruct the contents of the entire cache. As used in thisspecification, a checkpoint is a point at which information about thestatus of a job and/or the system state can be recorded so that the jobstep can later be restarted in its checkpointed state.

The checkpoint prefetching mechanism described herein is based on theobservation that there are many events that will temporarily touch largeamounts of data, displacing the application's regular working set. Theseevents are visible by both hardware and software. Examples of suchevents include garbage collection, system calls, context switches,searches or manipulations of large data structures, and so on. After thecompletion of the event, it is desirable to return the cache to itsstate prior to the event's occurrence (execute a Restore operation). Bycheckpointing (executing a Record operation) the cache's tags at theevent's initiation, a prefetching mechanism can restore the cachecontents at the event's completion using the cache tags as a referencepoint. This in turn reduces the number of cache misses caused by thedisplacement of the application's working set during the eventoccurrence.

FIG. 3 illustrates a hardware cache implementation which has beenaugmented with logic to implement a checkpointed-tag prefetcher. Ahardware cache with checkpointed-tag prefetcher contains one or moreduplicate set of cache tags 350, and a cache controller 310 that hasbeen augmented with checkpointed-tag prefetching logic 312. The cachecontroller 310 according to an embodiment of the present inventionincludes the following two components: the checkpoint prefetcher (CP)operator of the checkpoint prefetching mechanism 300, and the CPcontroller needed to recognize event initiation/completion and controlthe CP operator. Note that these two components may be implemented ashardware, firmware, or software.

This cache mechanism 300 operates in one of two modes: normal mode orcheckpointed mode. During normal operation, for each cache miss thecheckpointed tag array 350 is written with the virtual tag address,while the conventional tag array 320 is written with the physicaladdress (or virtually address in a virtually-tagged cache).Consequently, at any point in time during normal operation, thecheckpointed tag array 350 summarizes the contents of the cache usingvirtual addresses.

When operating in checkpointed mode, the checkpoint tag array 350 isfrozen, and subsequent updates to the cache are not reflected in thecheckpoint tag array 350 in order to maintain the checkpoint tag array350 in the state it was in at the initiation of the checkpoint request.

In addition to the usual functions performed by the cache controller310, this controller 310, with its augmented logic 312, will alsorespond to checkpoint and restore requests, as described in FIGS. 4 and5. The decoder 340 provides a mapping between the identifiers kept inthe checkpoint tag array 350 and the data arrays 330, just as it doeswith the regular tag array 320. Note that the identifiers could take theform of a virtual address. The decoder 340 would map or translate thevirtual address to a physical address.

Checkpoint Prefetcher Operation.

Referring again to FIG. 3, in its simplest embodiment, the CheckpointPrefetcher mechanism 300 contains CP prefetch logic 312 and a single setof duplicate cache tags 350. Other embodiments may contain additionalsets of tags. When operating in normal mode, at the initiation of apotentially cache-thrashing event, the CP controller 310 initiates a“save” operation (a Record operation). The save operation entailscopying the current cache tags 320 into the duplicate copy 350. Notethat any updates made to the cache 330 will not be reflected in thisduplicate copy 350 while the mechanism 300 is in Checkpoint mode.Updates to the tag checkpoint may be made in a lazy manner.

At the termination of a cache thrashing event, the CP controller 310initiates a “restore” operation. During a restore operation, the CPcontroller 310 issues a sequence of commands reloading the cache 330with any cache blocks referenced in the duplicate tags 350 but absentfrom the cache 330. Because such operations are simple hints forperformance improvement, they can be performed asynchronously, in thebackground, while the processor continues to execute instructions anduse the cache 330. In this simple form, the checkpoint prefetcher 300may restore the cache contents across a singlecache-thrashing/displacement event.

In another embodiment, a set of several cache tag checkpoints may besaved, allowing the cache state to be saved and restored across multiplecache displacing events. For example, it would be necessary to save thecache state across context switches when more than two software threadsshare a processor, or a single thread exhibits recurring phases ofexecution involving more than two distinct phases.

In another embodiment, the Prefetcher 300 may contain a single set ofduplicate tags with a memory-mapped interface by which software can readfrom or write to the duplicate tag memory. Software could use thisinterface to save an arbitrary number of tag checkpoints. As in theprevious embodiment, a save operation will copy the current tag state toa duplicate set of tags. Afterwards, through memory-mapped reads,software can copy the duplicate tag memory to a buffer where it willreside until the cache contents should be restored from this checkpoint.

The restoration process involves reloading the duplicate tag memory fromthis buffer using memory-mapped writes, followed by the restoreoperation that will cause the prefetcher logic to reload the cache'scontents based on the contents of the duplicate tag checkpoint memory.Through the addition of this memory-mapped checkpoint interface, thenumber of checkpoints that software may be able to reuse is unbounded.In addition to supporting an arbitrary number of contexts in amulti-programmed operating system or an arbitrary number of applicationphases, this interface would also be useful when performing processmigration in a multiprocessor system.

Save/Restore Initiation and Thrashing Event Recognition.

Save and restore operations may be software-initiated orhardware-initiated. In a software-initiated implementation, theprocessor would include an interface by which programmers can initiatesaves and restores of cache checkpoints, at any point that theprogrammer believes checkpoint prefetching may benefit. In ahardware-based implementation, logic embedded in the processor wouldinitiate cache checkpoint saves and restores based on events that arevisible within the processor, such as exception handling and phasedetection. For example, at the occurrence of an exception, a checkpointmay be made of the current cache state. Upon return from the exceptionhandler (occurrence of an rfi (return from interrupt) instruction inPowerPC) a restore operation would be initiated. Phase changes can bedetected using simple monitoring hardware and signatures constructeduniquely representing each phase. [see A. Dhodapkar and J. E. Smith,“Managing multi-configuration hardware via dynamic working setanalysis,” Proceedings of the 29th International Symposium on ComputerArchitecture (ISCA-29), May 2002, and T. Sherwood, S. Sair, and B.Calder, “Phase Tracking and Prediction,” Proceedings of the 30thInternational Symposium on Computer Architecture (ISCA-30), June 2003].

Accordingly, a hardware-initiated checkpoint prefetcher could performsaves at the detection of phase changes and restores at the reversion toa previously detected phase. Although a software-based implementationshould also be able to capture this behavior, in circumstances wheresoftware modification is not possible a hardware-initiated prefetchercan reap some of the benefits of checkpoint prefetching. A hybridimplementation may also be beneficial, which would allow a programmer toinitiate saves and restores in software, but would revert to apredictive hardware-based initiation in the absence of programmerdirectives.

Cache Tag Prefetcher in Normal Mode.

When operating in normal mode, cache miss requests behave as in aconventional cache, however the virtual address of the request is alsowritten to the tag checkpoint array. Referring to FIG. 4 there is showna flow chart of the cache management mechanism operating in normal mode,according to an embodiment of the present invention. The process beginsin step 410 when the cache controller 310 receives a cache miss requestor any other indication of a potential cache-thrashing event. In step420 the controller 310 fetches a cache block from lower level memory. Instep 430 the controller 310 writes data into the data array 330. Next instep 440 the address of the data and its metadata is written into thetag array 320. In the next step, 450, the process diverges fromconventional mechanisms. The controller 310 writes the virtual addressinto the tag checkpoint array 350. The checkpointed tag array 350 iswritten with the virtual tag address, while the conventional tag arrayis written with the physical address (or virtual address in avirtually-tagged cache). Consequently, at any point in time duringnormal operation, the checkpointed tag array 350 summarizes the contentsof the cache using the virtual addresses of the data blocks. In step 490the process is complete.

In response to a checkpoint request, in step 460 the cache controller310 simply transitions from normal mode to checkpointed mode. Restorerequests performed during normal mode are ignored by the cachecontroller 310.

When operating in checkpoint mode, requests that result in a cache misswill be serviced by the cache controller as in normal mode, however, noaddress will be written to the checkpointed tag array 350. In responseto a restore request, the cache controller 310 sequentially reads eachaddress contained in the tag checkpoint array 350. For each of theseaddresses, address translation is performed by the decoder 340, and ifthe address is valid, the cache is referenced and a cache miss requestis serviced if this reference results in a cache miss. Prefetches can befiltered by maintaining a thread-id as part of the tag-checkpoint array350 in order to prevent cache pollution due to restoration of irrelevantcache data. If the thread-id of a tag checkpoint array entry is notequivalent to the currently running software thread-id, the prefetch isignored. In this manner addresses that do not pertain to the runningthread (i.e. if they were brought into the cache by a different softwarethread) will be filtered. Likewise, if address translation is notsuccessful, the request is ignored and the next data block is processed.

FIG. 5 shows a flow chart of the cache management mechanism 300operating in checkpoint mode, according to an embodiment of the presentinvention. The process begins at step 510 when the cache controller 310receives a cache miss request. In step 520 the controller 310 fetches acache block from lower level memory and in step 530 writes data into thedata array 330. Additionally, in step 540 the controller 310 writes theaddress and metadata into the tag array 320. Processing is complete instep 590.

If the cache controller 310 receives a restore request in step 550 foraddress n=1, it fetches the address n from the tag checkpoint array 350in step 555. Next it performs address translation for address n in step560. If address n is valid in step 565, the controller 310 issues aprefetch for address n in step 570. Address invalidity may indicateunmatched thread-ids or an address referencing a data block that is incache memory. Note that the checkpoint tag array 350 was frozen duringthe time the mechanism 300 is in checkpoint mode. Therefore, any updatesto cache memory which might have occurred during this time period shouldnot be overwritten by cache memory which was “frozen.”

If n equals to the number of blocks in cache memory in step 575, thenthe restore request is complete and the controller 310 switches fromcheckpointed mode back to normal mode in step 580. Else, if n is lessthan the total number of blocks in cache memory, the processing returnsto step 555 wherein the controller 310 sequentially reads each addresscontained in the tag checkpoint array 350 and continues until all of nblocks have been restored. Note that in checkpoint mode, no address willbe written to the checkpointed tag array 350 in case of a cache miss. Instep 565 if address translation is not successful, then in step 575 thisblock is skipped and processing continues.

Checkpoint and restore operations may be initiated by either software orhardware. When initiated by software, both the checkpoint and therestore requests are non-blocking; meaning that the processor initiatingthe requests may proceed with instruction processing immediately,without waiting for the checkpoint or restore operation to complete.Software may initiate a checkpoint or restore operation using specialinstructions or memory-mapped reads/writes.

A hybrid implementation may also be beneficial, which would allow aprogrammer to initiate saves and restores in software, but would revertto a predictive hardware-based initiation in the absence of programmerdirectives.

What has been shown and discussed is a highly-simplified depiction of aprogrammable computer apparatus. Those skilled in the art willappreciate that a variety of alternatives are possible for theindividual elements, and their arrangement, described above, while stillfalling within the scope of the invention. Thus, while it is importantto note that the present invention has been described in the context ofa fully functioning data processing system, those of ordinary skill inthe art will appreciate that the processes of the present invention arecapable of being distributed in the form of a computer readable mediumof instructions and a variety of forms and that the present inventionapplies equally regardless of the particular type of signal bearingmedia actually used to carry out the distribution. Examples of signalbearing media include ROMs, DVD-ROMs, and transmission-type media, suchas digital and analog communication links, wired or wirelesscommunications links using transmission forms, such as, for example,radio frequency and light wave transmissions. The signal bearing mediamake take the form of coded formats that are decoded for use in aparticular data processing system.

Therefore, while there has been described what is presently consideredto be the preferred embodiment, it will understood by those skilled inthe art that other modifications can be made within the spirit of theinvention. The above descriptions of embodiments are not intended to beexhaustive or limiting in scope. The embodiments, as described, werechosen in order to explain the principles of the invention, show itspractical application, and enable those with ordinary skill in the artto understand how to make and use the invention. It should be understoodthat the invention is not limited to the embodiments described above,but rather should be interpreted within the full meaning and scope ofthe appended claims.

1. A dual-mode prefetch system for implementing checkpoint tagprefetching, said system comprising: a data array storing data fetchedfrom cache memory; a set of cache tags identifying the data stored inthe data array; a checkpoint tag array storing virtual addresses anddata identification information; and a cache controller with prefetchlogic, said prefetch logic comprising: a checkpoint prefetch operatorcarrying out checkpoint prefetch instructions; and a checkpoint prefetchcontroller performing: recognizing initiation and completion events;controlling the checkpoint prefetch operator; and restoring cache memoryto its state at a time when the initiation event occurs; wherein thecache system operates in normal mode and checkpoint mode; and whereinthe checkpoint mode comprises freezing the checkpoint tag array suchthat subsequent updates to the cache memory are not reflected in saidcheckpoint tag array.
 2. The dual-mode prefetch system of claim 1further comprising a decoder implementing address translations.
 3. Thedual-mode prefetch system of claim 1 further comprising a phasedetection mechanism performing: initiating a save operation upondetection of an end of a phase; and initiating a restore operation whensaid phase is resumed, wherein the restore operation restores cachememory to its state at the time when the end of the phase was detected.4. The dual-mode prefetch system of claim 1 wherein the of checkpointtag array comprises a memory-mapped interface by which software can readfrom and write to a checkpoint tag memory.
 5. A computer program productembodied on a tangible computer readable medium with computer-executableinstructions stored therein, said computer-executable instructionscomprising: storing data fetched from cache memory in a data array;identifying the data stored in the data array in a set of cache tags;storing virtual addresses and data identification information in acheckpoint tag array; and using prefetch logic comprising: carrying outcheckpoint prefetch instructions; recognizing initiation and completionevents; and restoring cache memory to its state at a time when theinitiation event occurs; wherein the cache system operates in normalmode and checkpoint mode; and wherein the checkpoint mode comprisesfreezing the checkpoint tag array such that subsequent updates to thecache memory are not reflected in said checkpoint tag array.
 6. Thecomputer program product of claim 5 wherein the computer-executableinstructions further comprise: implementing address translations.
 7. Thecomputer program product of claim 5 wherein the computer-executableinstructions further comprise: initiating a save operation upondetection of an end of a phase; and initiating a restore operation whensaid phase is resumed, wherein the restore operation restores cachememory to its state at the time when the end of the phase was detected.8. The computer program product of claim 5 wherein the checkpoint tagarray comprises a memory-mapped interface by which software can readfrom and write to a checkpoint tag memory.